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Abstract 

Transcription factors (TFs) are proteins that bind to specific sites on the DNA and regulate gene 
activity. Identifying where TF molecules bind and how much time they spend on their target sites is 
key for understanding transcriptional regulation. It is usually assumed that the free energy of binding 
of a TF to the DNA (the affinity of the site) is highly correlated to the amount of time the TF remains 
bound (the occupancy of the site). However, knowing the binding energy is not sufficient to infer 
actual binding site occupancy. This mismatch between the occupancy predicted by the affinity and the 
observed occupancy may be caused by various factors, such as TF abundance, competition between 
TFs or the arrangement of the sites on the DNA. We investigated the relationship between the affinity 
of a TF for a set of binding sites and their occupancy. In particular, we considered the case of lac 
repressor (lad) in E.coli and performed stochastic simulations of the TF dynamics on the DNA for 
various combinations of lad abundance in competition with TFs that contribute to macromolecular 
crowding. Our results showed that for medium and high affinity sites, TF competition does not play 
a significant role in genomic occupancy, except in cases when the abundance of lad is significantly 
increased or when a low-information content PWM was used. Nevertheless, for medium and low 
affinity sites, an increase in TF abundance (for both lad or other molecules) leads to an increase in 
occupancy at several sites. 

Keywords: facilitated diffusion, Position Weight Matrix, thermodynamic equilibrium, motif 
information content, molecular crowding 



1 Introduction 

A powerful key to understanding transcriptional regulation is the amount of time a regulatory binding 
site is occupied by a cognate transcription factor (TF). In particular, this 'occupancy' measure can 
be used to infer relative amounts of transcription of the target gene, and is therefore a more powerful 
comparative tool than simple sequence searches for 'preferred binding sites'. Transcription factors have 
specific affinities for each site on the DNA (computed from the binding energy between the TF protein 
and the DNA molecule at the target site) and it is often na ively assumed that this a ffinity is sufficient 
to predict the actual occupancy of TFs bound to the DNA ( Segal and Widom , 2009() . However, recent 
studies have d e mons trated that affinity alone is not always sufficient to accurately predict TF occupancy 



(IKaplan et all . 120111 ). 



Previous studies have shown that TF abundance can account for the correlation between the nor- 
malised affi nity and normalised o c cupancy ("normalised" here refers to setting the maximum observed 
values to I) dBerg and von Hippel . 1987 ; Diordievic et al. . 2003 ; Gerland et al. . 2002 ; Roider et al. . 2007 ; 
von Hippel and Berd . 1986 ; Zhao et al. , 2009 ). in the sense that increasing TF abundance increases the 



1 



2 



number of occupied sites and that those additional sites are of decreasing affinity. This result was ex- 
plained by the fact that, once the high affinity sites get close to saturation, TF molecules will spend 
more time bound to lower affinity sites. However, in those studies the spatial organisation of sites on 
the DNA was disreg arded. Such an assumpt ion should predict occupancy for in vitro experiments such 
as SELEX or PBM ( Stormo and Zhaol 2010l ). (where there are only short DNA sequences and one TF 
species), whilst in in vivo studies, could lead to biased predictions. 

A popular approach to estimate occupancy is the statistical thermodynamics framework. This method 
computes the probability that, at equilibrium , one e nc ounters a specific configur ation of TF molecules 



on the DNA (jAckers et all Il982l iBintu et all l2005allbl ; iRaveh-Sadka et all l2009t) . A number of studies 



consider a uniform affinity landscape for TFs or other DNA-binding proteins and focus on t he occupancy 



of a single site (or a few sit es) in the context of a gen ome with otherwise constant affinity (jAckers et al 



ll982l; lBintu et al.l.l2005a|lbl;lRaveh-Sadka et all 120091) . However, TFs display a distribution of affinities to 



the DNA (jGerland et all 120021 : iStormol . |2000) and , thus, the assumption of a un iform landscape becomes 



restrictive (and can lead to biases in the results). IWasson and Hartemink ( 2009h considered non- uniform 
affinity landscapes and investigated the relationship between the abundance of DNA-binding proteins and 
their occupancy using a statistical thermodynamics model. Their results confirmed that, when increasing 
TF abundance, low affinity sites display higher occupancy than that which would be predicted by affinity 
alone. Furthermore, the addition of other DNA-binding pr oteins (histone s in th eir case) leads to an overall 
reduction in occupancy of the TFs of interest. Similarly, Kaplan et al.l (|201ll ) applied a combination of 
a hidden Markov model and a thermodynamic framework and discovered that TF competition does not 
influence the observed occupancy significantly (at least in the case of their system). Nevertheless, they 
considered only the competition between various TF species and did not alter the abundance of their 
TFs of interest (they used the actual TF abundance that was experimentally measured). 

The main assumption of the statistical thermodynamic framework is that the system reaches equ i- 
librium and the transient time (the time to reach equilibrium) is negligible (jSegal and Widom . l2009h . 
Nevertheless, there is still no proof that, in the case of the TF search process, equilibrium exists or is 
reached fast enough to not affect the average behaviour. We use a stochastic simulation of the process 
by which a TF 'searches' for it's regulatory binding site by first binding non-spccifically to the DNA 
and then performing a one-dimensional random walk before eventually unbinding. This combination of 
binding/unbindin g to/from the DN A and one-dimensional random walk is known as a facilitated diffu- 
sion mechanism ( Berg et all 1981) an d it is evident that such a process is taking place inside the cell 
( Elf et all . l2007t lHammar et all 20121 ). The physical advantage of facilitated diffusion over a purely 
thre e-dimensional diffusion or a purely one-dimensional random walk is a more rapid target site location; 
see ( Zabet and Adrvanl 2012bl ). Simulating facilitated diffusion can overcome some of the limitations of 
the statistical thermodynamics model by allowing 'exact' in silico measurement of the average occupancy 
of TF binding sites under various parametrisations of the cellular state (e.g. concentrations of DNA 
binding proteins), some of which will gi ve rise to deviatio ns from the predictions offered by the statistical 
thermodynamics model. For example, IChu et al. I (|2009h demonstrate such deviations when they model 
TFs as having non-uniform affinity landscapes. 

Here, we used a stochastic simulator that models the facilitated diffusion mechani sm and studied th e 
properties of a complete continuous DNA sequence (from the genome of E.coli K-12 ( Rilev et al. . 20061) ) 



being bound by both a cognate TF species (lad in our case) and a non-cognate TF spe cies (aimed to model 
the pr esence of other proteins on the DNA which contribute to crowding on the DNA) (jZabet and Adrvanl . 
2012al fcl). This scenario mimics the behaviour of TF molecules in a live cell performing facilitated diffusion 



in the search for their target sites. The TF molecules will not only compete with other molecules bound 
to the DNA for sites, but during t he one-dimensional random walk on the DNA, they will slide or hop 
to nearby si t es (|Mirnv et al. , 2009) and also bypass other bound mole cules (iHedglin and O'Brienl . l201fA 
Kampmann . 2004}) which act as obstacles and create boundary effects ( Segal and Widom . 20091) . 



Our results confirm that the addition of non-cognate TFs reduces the absolute occupancy of cognate 
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TF binding sites, while their relative occupancy is influenced at relatively few (in the order of tens) low 
and medium affinity sites, and is unaffected at high affinity sites. That is, for low affinity ("non-specific") 
and medium affinity sites, the addition of non-cognate TFs leads to significant differences between the 
predicted relative occupancy based on affinity (which we call affinity derived occupancy, or ADO) and 
the relative occupancy measured by stochastic simulation (which we call simulation derived occupancy, 
or SDO) at several sites, whilst for high affinity sites this relative binding pattern is unaffected. While 
the mismatch associated with low affinity sites should have little or no influence on gene regulation 
(unless the cognate TF mo lecules change conformation when bound to a functional high affinity site 
(jMarcovitz and Levvl l201lh ). ;his may provide an explanation for the noise structure in actual genomic 
profiles of TF occupancy (e.g. ChIP data). 

We further found that differences between ADO and SDO at medium and high affinity sites can arise if 
the cognate TF abundance is significantly increased or if the information content o f the PWM is low. How- 
ever, for normal bacterial TF abunda nces (usually in the range of 10 — 100 copies ( Wunderlich and Mirny , 



20091 )). PWM inform ation conten t dStormo and Fields! . 1 19981: IWunderlich and Mirnvri2009D and DNA 
sizes (e.g., 4.6 Mbp ( Rilev et al. , 20061 )). the differences between the SDO and ADO are negligible 



and binding energies are good indicators of oc cupancy. Nev ertheless, in the case of eukaryotic sys- 
tems, their high TF abundan ces (> 10 4 copies ( Bigginl 12011 )). their lower information content motifs 
(jWunderlich and Mirnvl l2009h and the amount of accessible DNA suggest that significant differences be- 
tween ADO and SDO are likely to occur. Nevertheless, this increase in occupancy generated by the high 
abundance of cognate TFs can be reduced, to a certain degree, by a high abundance of non-cognate TF 
molecules in the system. 



2 Materials and Methods 



We use GRiP (|Zabet and Adrvanl |2012c|) to simulate facilitated diffusion of DNA-binding proteins around 
the DNA, which allows parametrisation with affinity data and meas ures site occupancy. Briefly, GRiP 
performs event driven stochastic simulations (|Gillespid . Il976l 119771 ) of all molecules in the cell which 
are explicitly represented. Molecules perform both a three-dimensional diffusion in the cytoplasm (nu- 
cleoplasm in the case of eukaryotic cells) and a one-dimensional random walk on the DNA. The three- 
dimensional diffusion is modelled implicitly by simulating the Chemical Master Eq uation. This approac h 
was shown to display negligible error if fast rebinding to the DNA is also modelled (jvan Zon et all 120061 ) . 
and, in GRiP, fast rebinding is modelled through hopping mechanism of TFs on the DNA. In addition, 
the model impl ements steric hindranc e, in the sense that any base pair cannot be covered by two TFs 
simultaneous ly (Hermsen et al. . 20061). The complete set of parameters for the model were previously 
presented in ( Zabet and Adrvan . 12012a ) and can be found in Avvendi£K\ 

In this study, we consid er the c ase of lac repressor (lad) TF in E. coli, with an association rate to the 
2400 s" 1 (jZabetl I2012T) and a specificity as modelled by the PWM in Figure [Q 



DNA of fcf a s c s j oc 




1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 
Position 



Figu re 1. lad se quence logo The canonical lad motif as generated from the three known high affinity 
sites (|Zabetl . l2012h . 
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In addition to lad, the system explicitly represents non-cognate molecules in order to model macro- 
molecular crowding. Each non-cognate molecule covers 46 bp of DNA and is allowed to perform the 
facilitated diffusion mechanism in a similar way to cognate molecules (jZabet and Adrvanl . l2012aT ) . We 
consider five levels of crowding, namely: (i) 0% (TF n ° c = 0), (ii) 9% (TF° C 09 = 10 4 and A;^ soc = 2000 s" 1 ), 
(in) 26% (TF n ° c 26 = 3 x 10 4 and fcjf c soc = 2571 s" 1 ), (iv) 42% (TF n ° c 42 = 5 x 10 4 and k™ soc = 3600 s" 1 ) 
and (v) 55% (TF n ° c 55 = 7 x 10 4 and jfc^ soc = 6000 s" 1 ). Note that, with the exception of the first case 
(no cro wding on the DNA), all cases display crowding which is within biologically plausible values (10% 



to 50% (|Flvvbjerg et all l2006h ) 



Before proceeding to investigate the relationship between affinity derived occupancy (ADO) and 
simulation derived occupancy (SDO), we first need to describe the methods used to estimate these 
parameters. ADO is computed using the average time a TF molecule spends bound at a certain position 
on the DNA as derived from an approximation of the bin ding energy (which is itself calculated from 
PWM score); see equation (3) in ( Zabet and Adrvan , l2012al) . Briefly, the affinity predicted occupancy of 
a TF bound at the j th nucleotide on the DNA is given by 



T lacl = T lacl CX P 



K B T 



-F? 

-^lacl 



(1) 



where T[° cI is the average waiting time when bound at 0\ site, E( acI 



is the binding energy at position j 



(which is equal to Ef ci = — mlacF, where wdacP is the lacl PW M score at the j th nucleotide), Kb is the 
Boltzmann constant and T the temperature. In rtZabetl . 120121 ). we computed i = 1.18e -06 . 

All ADO vs SDO plots consider log values that are normalised to the maximum ADO or SDO, 
respectively. For example, in the case of affinity predicted occupancy, we plot: 



log 



'lad 



max {^ac/} 



(2) 



While ADO is computed directly from the PWM (a priori to the simulations) the SDO (simulation 
derived occupancy) is based on the results of our stochastic simulations. There are several ways in which 
the SDO can be estimated and in the following section we compare these approaches to justify our choice. 



2.1 Measuring the occupancy 

There are three methods to estimate the observed occupancy, namely: 

1. Ensemble average - Perform a set of X stochastic simulations with identical parameters, each 
running for a time interval T s (chosen as adequate to reach a stationary behaviour) and record the 
position of each molecule at the end of the simulation. Using these X sets of position s, measure the 



occup ancy by computing the average amount of time the TF spends at each position (jKaplan et al 



20111 ). [Note: this is effectively the result obtained from a ChIP experiment: the mean behaviour 



within an ensemble of cells.] 

2. Time average - Observe a single system for a much longer time interv al Ti and compute the occu - 
pancy as the average amount of time the TF spends at each position ( Zabet and Adrvanl . 2012af ). 
The time average can take less time to compute and, consequently, is an appealing method to es- 
timate occupancy. In live cells, the activity state of a gene is related to the proportion of time the 
regulatory region is occupied an d, thus, the time average m ay be a better indicator for biological 
relevance than ensemble average ( Zabet and Adrvan . 2012bl) . Nevertheless, if one wants to replicate 
the result of ChIP experiments, then the ensemble average is more appropriate. 
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3. Hybrid average - Perform a set of X stochastic simulations for a long time interval TJ. For each 
simulation calculate the time average occupancy and then perform an ensemble average over all 
time averages. At the population level, there is an ensemble average over the behaviour of all cells, 
thus the hybrid average is a good indicator of the occupancy when investigating gene regulation at 
population level. 



The ergodic theorem assumes that the time average for long time intervals equals the ensemble average. 
However, the ergodicity assumption breaks down in certain cases (e.g. the time average differs from the 
ensemble average in multi-stable systems (iGillespid . l2000h . Thus, we need to investigate under what 
conditions the ergodicity assumptions break down within our system. 

Figure^ A) confirms that the time average, hybrid average and ensemble average measures for SDO 
produce similar results. In this case, the system consists of a DNA molecule and one lad TF and zero 
non-cognates. In addition, one can observe that all measures for SDO display negligible differences from 
ADO. 



1 lacl molecule and 
non-cognates molecules 



average 

ensemble 

time 

hybrid 



20 lacl molecules and 
non-cognate molecules 




average 
ensemble 
+ time 
hybrid 



log(normalised ADO) 



-12 -10 -8 -6 -4 -2 
log(normalised ADO) 



1 lacl molecule and 
20 non-cognate molecules 




log(normalised ADO) 



Figure 2. Comparison between the ensemble, time and hybrid averages of SDO in a crowded 
environment. We considered 1 K bp of DNA, which contains the Oi site (the strongest known binding 
site for lacl, which is located at position 365,547 — 365,567 on the E.coli K-12 genome) and: (^4) 1 
lac repressor molecule and non-cognate molecules, (B) 20 lac repressor molecules and non-cognate 
molecules and (C) 1 lac repressor molecule and 20 non-cognate molecules. We plotted the sites that 
have a binding energy at least 30% of the highest value (577 strongest sites). (A) The ensemble average 
is computed from X = 2 x 10 6 independent simulations [blue circles]; the time average is computed by 
running the simulations for TJ = 3000 s [red crosses]; and the hybrid average is computed by running 
X = 40 independent simulations for 7} = 3000 s [green triangles]. (B) The ensemble average is computed 
from X = 1 x 10 5 independent simulations (blue circles); the time average is computed by running 
the simulations for Ti = 150 s [red crosses]; and the hybrid average is computed by running X = 
40 independent simulations for 2] = 150 s [green triangles]. (C) The ensemble average is computed 
from X = 2 x 10 6 independent simulations [blue circles]; the time average is computed by running 
the simulations for Ti = 3000 s [red crosses]; and the hybrid averageis computed by running X = 40 
independent simulations for 1} = 3000 s [green triangles]. Table[T]shows that the three measures for SDO 
appear to have the same mean. 

By increasing the copy number of the TF, the ensemble average and time average diverge. Figure[5J-B) 
models 20 lacl molecules and zero non-cognates, and it is clear that in some cases the time average values 
(red crosses) diverge from their associated ensemble average values (blue circles) and hybrid average 



G 



values (green triangles). The more dramatic effect, however, is the significant deviation of SDO from 
ADO for all three measures. This shows that for significantly increased TF copy number, whilst the 
crgodicity assumption has begun to break down, the differences introduced are insignificant compared to 
the increased SDO observed at a large number of sites. 

The case of increased crowding on the DNA, as modelled by the addition of non-cognate TFs, is 
shown in Figure G^C). Here the cognate abundance is kept fixed to one molecule, while 20 non-cognates 
arc modelled. The figure shows that a significant increase in the number of non-cognates has a negligblc 
effect on all three measures of SDO. 

Table Q] shows that in the case of naked DNA and one molecule of lad, the three measurements for 
SDO (ensemble, time and hybrid averages) have approximately the same mean. However, molecular 
crowding on the DNA leads to deviations between ensemble and hybrid averages. In particular, in the 
case of high abundance of cognate TFs - 20 molecules of lad - we observed a mean increase of ~ 33% in 
the hybrid average compared to the ensemble average, while in the case of high abundance of non-cognate 
TFs - 20 non-cognate molecules - we observed a decrease of ~ 14% in the hybrid average compared to the 
ensemble average. In addition, in AvvendiJ[B\ we show that, when the simulation time is increased, the 
mean ratio of hybrid and ensemble averages tends to 1 and the deviations from the mean are reduced. 





1 lad 


20 lad 


1 lad 




non-cognates 


non-cognates 


20 non-cognates 




mean 


p.value 


mean 


p.value 


mean 


p.value 


log (time/ ensemble) 


-0.0132 


0.1687 


-0.0148 


0.1546 


0.0788 


2.65e- 13 


log (hybrid/ ensemble) 


0.0221 


0.0212 


0.0112 


0.2800 


-0.1513 


2.21e- bl 



Table 1. mean and t-test p-values of log(time/ ensemble) and log(hybrid/ ensemble) averages of SDO 
for three levels of crowding. The table shows the effect of crowding for different measures of occupancy. 
The three measures are time average, ensemble average and hybrid average. The system model is as in 
Figured The log ratios of (time/ensemble) and (hybrid/ ensemble) show significant deviations from zero 
as measured by a standard one-sample t-test in the case of 1 lad and 20 non-cognates. This demonstrates 
that the ergodic theorem does not hold for this level of crowding as measured by the model. 



Due to the fact that we are interested in genomic occupancy of TFs that are involved in the regulation 
of transcription and that, in particular, we are interested in cell population results, we use the hybrid 
average in all subsequent calculations within this manuscript. Nevertheless, it should be noted that using 
any of the three methods will lead to similar results. 



2.2 System size reduction 

Our results are o btained by simulating TF occupancy on the 100 Kbp of the E.coli K-12 genome 
(|Rilev et all 120061) (the DNA lo cus [300000 400000]), roughly centered around the Oy site (the most 



strongly bound site for lad). In ( Zabetl . |2012[ ). we proposed two models that are required to adapt the 



parameters of the subsystem, namely: (i) copy number model and (ii) association rate model. The former 
is easier to implement, but can be applied only to highly abundant TFs, while the latter requires an extra 
set of simulations, but can be applied to TFs with any abundance. Due to the fact that non-cognate TFs 
arc highly abundant in our system, we applied the copy number model to simulate the non-cognate TFs. 
This leads to the association rate between non-cognate TFs and DNA being unaffected, but the abun- 
dances of non-cognate TFs changing to: (%) TF® C = for 0% crowding, (ii) TF® C 09 — 216 for 9% crowding, 
(Hi) TF° C 26 = 647 for 26% crowding, (iv) TF° C 42 = 1078 for 42% crowding and (v) TF° C 55 = 1509 for 
55% crowding. Note that, in this manuscript, crowding refers to the percentage of the simulated DNA 
covered by DNA-binding proteins. 
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For lad, we considered four abundances, namely: 1, 10, 100, 1000. Due to the lower copy number, we 
used the association rate approach to adjust the parameters of the full system to the subsystem. This 
leads to the c opy number of lad being unaffected, but its association rate changing from k^f c = 2400 s _1 
(IZabetl . l20lj ' ;o the values listed in Table [5] In AvvendiJiCi we plotted the proportion of time spent on 
the DNA (which is required when computing the association rate) and also confirmed that our system 
size reduction method leads to a system behaviour that deviates only negligibly from the behaviour of 
the full system. 
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DNA 


—assoc ^ 
^llacl S 
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"alaci s 


—assoc | 
^llacl s 


—assoc ^ 
^llacl s 


0% 


4.19 


4.04 


4.11 


4.19 


9% 


4.58 


4.63 


4.67 


4.74 


26% 


6.11 


6.10 


6.19 


6.32 


42% 


8.63 


8.76 


8.73 


8.88 


55% 


13.15 


13.05 


13.06 


13.26 



Table 2. The association rate of lad in the 100 Kbp subsystem for various crowding levels on the 
DNAThe over bar is used to denote the corresponding parameters in the subsystem. 



3 Results 



In (jZabet and Adrvanl . l2012a|) . we found that, under certain conditions, the occupancy in the simulations 
cannot always be predicted based on the affinity. To systematically assess the source of the mismatch 
between affinity derived occupancy (ADO) and simulation derived occupancy (SDO), we considered the 
case of a bacterial TF (lad) with biologically plausible parameters and investigated the relationship 
between affinity and occupancy. Figure [3] contains scatter plots of the SDO vs. ADO at individual sites 
(at 1 bp resolution) for various crowding levels on the DNA, and various lad abundances. To eliminate 
weak sites which will not facilitate the formation of a strong complex with lad, we recorded only sites 
with high affinity i?/ acI > E la ^ x 0.7. We chose this threshold to select the top 0.5% of sites based on the 
distribution of binding energies, but the value of the threshold can be selected to match any distribution 
of binding energies. 

Figure [3K A) shows that for 1 lad molecule, there is an excellent agreement between ADO and SDO 
even in the case of crowding on the DNA. The mean ratio of SDO to ADO for 1 lacl molecule with 26% 
crowding is 0.966, within a 95% confidence interval (0.825,1.120). This suggests that, even in the case 
of leaky gene expression (1 or a few TF molecules), the TF is able to regulate a gene within a cell cycle 
and the percentage of time the site is occupied is not affected by crow ding. 

Usually, bacterial TFs number between 10 and 100 copies per cell ( Wunderlich and Mirny . 20091) . In 



this case, as well as in the case of 1 lacl molecule, the addition of non-cognate TFs does not appear to 
introduce a significant difference between ADO and SDO. 

Finally, a few bacterial TFs are kno wn to exist in high copy numbers (e.g. the copy number of CRP 



is ?s 1000 (jSantillan and MackevL 120041 )) and Figure [3J A) confirms that, in the case of highly abundant 
bacterial TFs, the ADO diverges from the SDO. In particular, we observed a two-fold increase in SDO, 
compared to ADO; see Table [3] This indicates that certain sites (for example O2, the second strongest 
site of lacl) will display a higher degree of occupancy than that predicted by affinity. 

Next, we considered the effect of increased crowding of the DNA by non-cognates on the relationship 
between ADO and SDO. Figure [3J-B) shows that increasing the crowding level has a negligible effect on 
this relationship and that ADO is a good approximator of SDO at all levels of non-cognate crowding 
when 10 lacl molecules arc modelled; sec also Table |4j 
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Figure 3. ADO and SDO for various abundances of lacl and crowding on the DNA. We 

considered the case of the lac repressor TF and 100 Kbp of DNA, which contain th e 0^ site. Each 
system was simulated for T i = 30 00 s (which is the average cell cycle time of E.coli (jRosenfeld et al 



20051 ; ISantillan and Mackevl . 120041) ) and, for each set of parameters, we considered X = 40 independent 
simulations. We considered only the sites that have the binding energy at least 70% of the highest value 
(the strongest 437 sites). (^4) Five different lacl copy numbers: (i) 1, (ii) 10, (Hi) 100, (iv) 1000 and (v) 
10000. We assumed the case of 3 x 10 4 copies of non-cognate TFs, which lead to 26% of the DNA being 
covered. (B) Five different non-cognate copy numbers: (i) 0, (ii) 1 x 10 4 , (Hi) 3 x 10 4 , (iv) 5 x 10 4 and 
(v) 7 x 10 , and 10 copies of lacl. 



mean 


0.966 


1.081 


1.090 


1.950 


9.782 


lacl copies 
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100 
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10000 


1 




(0.108,0.123) 


(0.117,0.131) 


(0.973,0.995) 


(8.680,8.950) 


10 






(0.006,0.012) 


(0.860,0.877) 


(8.570,8.830) 


100 








(0.851,0.868) 


(8.560,8.820) 


1000 










(7.700,7.970) 



Table 3. Confidence intervals around change in ratio SDO/ ADO with 26% crowding. 95% t-test 
confidence interval for the difference in mean ratio SDO/ ADO between abundances of lacl transcription 
factor. For example, moving from 1 lacl copy to 1000 copies sees the confidence interval at (0.880,0.909) 
- in other words the mean ratio has shifted by nearly 1. This is reflected in the raw mean values for 1 
copy and 1000 copies of 1.066 and 1.960 respectively. 



Altogether, non-cognate binding proteins do not affect the occupancy of medium and high affinity 
sites, in the sense that the SDO of medium and high affinity sites is accurately approximated by the 
ADO. However, by significantly increasing the abundance of cognate TFs, ADO ceases to be a good 
approximator of the SDO of medium and high affinity sites. Thus, only cognate abundance influences 
the occupancy of medium and high affinity sites, while non-cognate TFs have only limited effect. 

The results shown in Figure El use normalised measures of occupancy (ADO and SDO), which are the 
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%of 
covered 
DNA 


0% 


9% 


26% 


42% 


55% 


mean 


1.010 


0.968 


1.081 


0.993 


1.066 


C.I. 


(0.008, 0.012) 


(-0.035,-0.030) 


(-0.076,-0.080) 


(-0.011,-0.005) 


(0.059,0.067) 



Table 4. Effect of crowding on ratio SDO/ADO for 10 lad molecules. The table shows the mean 
SDO/ADO ratio for different levels of crowding. Confidence intervals are from a 95% t-test and show 
shift in mean ratio from 0% crowding level. 



relative values with respect to the highest rate of occupancy at the strongest site. When analysing the 
absolute values for occupancy. IWasson and Harteminkl ( 20091) observed that the addition of non-specific 
DNA binding proteins (nucleosomes in their studies) will reduce the absolute occupancy of cognate TFs. 
In AvvendiJiD\we show that the SDO increases when the lad abundance is increa sed and slightly decreases 
when the non-cognate abundance is increased, supporting the results from ([Wasson and Hartemink . 
20091) . 



3.1 Non-specific sites 

Figure [3] considers only sites with an affinity above a specific threshold. Besides providing more clarity, 
the rationale for this restriction was twofold: First, there is no clear evidence for the biological relevance 
of extreme low affinity sites, and second, we are only interested in amounts of occupancy that would be 
detectable in a biochemical assay (i.e. extreme low affinity binding events are likely not detectable), as 
the theoretical explanation of observed binding profiles is one of the goals of our research. 

Figure H] shows hcatmaps representing the number of sites where the ratio between SDO and ADO is 
higher than a factor SDO/ADO > S. For example, when 6 > 1, the graph considers the sites where oc- 
cupancy predicted from affinity underestimates the occupancy observed in the simulations. Interestingly, 
we did not find any sites where the SDO is lower than the ADO (which we call 'false negative' sites), 
under the various combinations of lad abundances and crowding levels on the DNA (data not shown). 

However, we found sites where SDO > ADO and we call these sites 'false positives'. For lad abun- 
dances within [1,100] copies - Figures SK^4-C) - there are tens of sites where the SDO is higher by at least 
50% compared to the ADO (S > 1.5). These sites appear only for high levels of crowding (at least 42%) 
and their number is increased by increasing the crowding. This means that by increasing the crowding 
on the DNA the number of sites where SDO is higher than ADO also increases. We also investigated 
if there is a particular affinity of the sites where the SDO exceeds ADO and found that these sites are 
usually distributed amongst the medium and non-specific sites; see Avvendi$F\ 

When we looked for larger differences between SDO and ADO we saw that by increasing S we observed 
fewer false positive sites. In particular, for [1, 100] copies of lad, there is no site where the occupancy in 
the simulations is higher by 150% (i.e. S > 2.5) than the value predicted by the affinity. This supports 
the conclusion from the previous section that the occupancy we observed in the simulations does not 
significantly deviate from that predicted based on the affinity. 

In the case of 1000 copies of lad, the results differ. Specifically, there appears to be two regimes, 
namely: (i) for S < 2 and (ii) for 6 > 2. In the first of these (S e [1.5, 2.0]), increasing the number of non- 
cognate molecules reduces the number of sites where the SDO/ADO < S. In other words, in this regime, 
increased crowding on the DNA has the opposite effect than that for lower lad copy numbers (see above) : 
it reduces the number of false positive sites. In the case of 1000 copies of lad, the mean SDO/ADO ratio 
is 5 r « 2 (whilst when lad abundance < 100 copies it is approximately 1) and by adding non-cognates 
the number of bound cognate molecules at sites whose SDO/ADO < S r is reduced (see AvvendiJiEl . In 
turn the mean SDO / ADO ratio will be reduced which in turn explains why the number of false positive 
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Figure 4. Significant deviations between ADO and SDO. In this graph, we did not consider any 
affinity cut-off and plotted the number of sites where the ratio between SDO and ADO exceeds 5 for a 
range of values of 5 € [1.5, 2.5]. There are four cases: (^4) 1 lacl molecule, (B) 10 lacl molecules, (C) 100 
lacl molecules and (D) 1000 lacl molecules. 



sites decreases. In the latter case (r5 € (2.0,2.5]), we observe a similar effect as for lower abundances of 
lacl, namely that increasing the crowding on the DNA increases the number of bound cognate molecules 
at sites where SDO/ ADO > S r . 



3.2 Considerations on eukaryotic cells 

Eukaryotes typically hav e 3 x 10 4 TF c opies per cell ( BigginL 201lh . with some abundances being is high 



as 3 x 10 6 copies per cell ( Bigginl . 120111 ). This higher abundance of TFs comapred to prokaryotes appears 
to reflect that eukaryotic genomes are much longer, giving much greater space in which TFs can bind 
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(IKaplan et all 120111 ). However, at any one time large parts of cukaryotic genome arc packed into dense 



chromatin, and are thus inaccessible to TF binding. For example, in the D. melanogaster embryo, on 
averag e only 4.1 Mbp of the euchromatic genome of 118 Mbp is accessible during each early developmental 



stage (IThomas et all 120111) . This means that, in such cukaryotic cells, we have accessible DNA that is 



similar in length to that considered in this study (the E.coli genome is approximately 4.6 Mbp), but with 
TFs in much greater abundance. This begs the question of whether the relationship between occupancy 
and affinity that we observe when simulating the prokayrotic case (lad around the 0\ site) is still true 
in the context of cukaryotic systems with TFs that have ~ 10 4 copies or more. 

It is clear from Figure[3]that increasing the abundance of cognate TFs up to 10 4 , increases the number 
of medium affinity sites that display significantly higher occupancy; see also Table [3] This observation 
remains true for different levels of crowding on the DNA as introduced by the presence of non-cognate 
TFs (no crowding, low crowding and medium crowding) (data not shown). Furthermore, at such high 
levels of cognate abundance almost all sites display a much higher occupancy than that predicted from 
their affinity. For example, the occupancy of the second strongest site of lad (O2) becomes approximately 
equal to that of the strongest one (Oi), although there is a large difference in affinity between the two 
sites. This observation suggests that high TF abundance makes strong and weak sites less distinguishable, 
which would hinder a quantitative readout for the regulation of gene expression in the cell. 

Above, we considered occupancy and affinity at single nucleotide resolution. Figure [5] shows a the- 
oretical TF binding profile over a locus of the E.coli genome as calculated using GRIP, demonstrating 
the progressive effect on occ upancy of increasing TF abundance. (The theoretical profiles are generated 
using a method described bv lKaplan et al. ( 2011 ) for modelling ChlP-seq profiles; see AvvendiJiG^. Each 
chart plots the ADO and SDO, and shows that for low copy numbers ([10,100] copies per cell), the 
profile of the ADO (filled region) matches the profile of SDO (solid line) with high accuracy for the 
cases of no crowding on the DNA (0 non-cognate molecules) and medium crowding on the DNA (3 x 10 4 
non-cognate molecules). This would imply that, in bacterial cells (i.e. when TF abundance is relatively 
low), the binding of TFs to their target sites is not affected by competition with other molecules, and 
occupancy is predominantly a factor of, and is accurately modeled by, affinity. However, when TFs are 
highly abundant ([10 3 ,10 4 ] copies per cell), as is common in eukaryotic systems, the level of affinity is 
not the sole determinant of occupancy on the DNA. In other words, the amount of time spent bound is 
determined not just by the encoded information in the DNA (nucleotide composition of binding sites) 
and DNA accessibility, but by the abundance of TFs in the system (mainly cognate TF abundance, but 
small effects from non-cognates were observed). 

Finally, bacteria l TFs have PWMs with higher infor mation content compared to the cukaryotic TFs 
( Stormo and Fields! . 119981: IWunderlich and MirnvL 120091) , e.g., for lad, Ii ac i = 16.9 bits. To investigate 
the influence of information content on the number of highly occupied sites observed in the simulations, we 
removed positions from the end of the lad motif and performed the simulations at various abundances of 
lad on naked DNA (i.e. no non-cognate TF molecules). In total, we considered six cases, which resulted 
in the information content of the reduced lad motif being: (i) li ac i 1 = 15.8, (ii) Ii ac i 2 — 14.7, (Hi) 
I iac / 3 12.7, (iv) Iiach = 10.7, (v) Iiad s = 8.7 and (vi) 7; ac / 6 = 7.7; sec AppendiJ(H\ Figure [6] shows that, 
by selecting an arbitrary threshold (certain percent of the highest value of SDO), the number of sites with 
SDO higher than the threshold increases both as the abundance of lad increases (compare the values 
on each row in Figure [6]), and as the information content of the motif decreases (compare the values on 
each column in Figure [6]). Note that the former (the dependence of the SDO on the TF abundance) was 
already shown in Figure [3] and Figure [5j Hence, in eukaryotic systems, we ca n expect a tw o fold increase 
in the number of sites with high SDO from both the greater TF abundance (Biggrnl. 20 111) and from the 
likely lower information content of the average eukaryotic P WM ( Wunderlich and Mirnvl . 20091 ) . 

Note that by removing certain positions from the end of the lad motif, we reduced the information 
content in a biased way and this can lead to small variations in the occupancy, particularly, in the 
case when there are a few sites that display high occupancy. Nevertheless, this approach to change 
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Figure 5. SDO and ADO landscape for various cognate and non-cognate abundances. We 

considered the case of the lac repressor TF and 100 Kbp of DNA, which contain the 0\ site. In each 
chart the solid green line is the SDO at one of four levels of lad abundance, and the filled green region 
is the ADO. The SDO shown is calculated with non-cognate molecules; calculations for 10% and 26% 
non-cognate abundance show no visible deviation from the non-cognate case (hence not shown). The 
SDO was calculated at four lad abundances: (A) 10, (B) 100, (C) 1000 and (D) 10000 molecules. Each 
system was simulated for T; = 3000 s and for each set of parameters we consider X — 40 independent 
simulations. We considered only the sites that have the binding energy at least 70% of the highest value 
(the strongest 437 sites). We conve rted the single nucleotide resolution into expected ChlP-seq profiles 
as proposed in (jKaplan et all 120111 ) ; see Appending 



the information content does not influence the general result, that TFs with lower information content 
motifs display more dramatic change in the number of sites highly occupied compared to TFs with higher 
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Figure 6. The relationship between information content of the PWM motif and the abun- 
dance of TF. This graph represents the number of sites that display an occupancy in the simulation 
that is higher than the following thresholds: (A) 0.25 • max (SIX?) , (B) 0.50 • max {SDO) and (C) 
0.75 • max (SDO). There were no non-cognate TFs in these cases and occupancy was calculated at abun- 
dances of lacl S {1, 10, 100, 1000, 10000}. Information content of the lacl motif was reduced by succesively 
removing the rightmost column of the PWM (see Avvendi JiH^ . In general the number of high occupancy 
sites is increased by both increased lacl abundance (compare the values on each row) and reduced infor- 
mation content (compare the values on each column). In (B) at the highest lacl abundance, there are 
several cases where the number of highly occupied sites decreases with reducing the information content 
(from 16 to 8) contrary to the pattern at other abundances and/or thresholds. This can be explained by 
the fact that, in order to reduce the information content, we removed certain base pairs from the lacl 
motif, which can introduce biases in the affinity landscape. These biases can lead to small deviations from 
the expected results, particularly, in the cases where there are few sites and the TF has high abundance. 
For example, in the case of the 10000 copies of lacl with the full motif, there are sites that display an 
occupancy of 0.6 • max(SDO), while, in the case of 10000 copies of lacl with information content 14.7, 
those sites will display an occupancy of 0.4 • max (SDO). 



information content motifs. 



4 Discussion 

Transcription factors perform a combination of three-dimensional diffusion and one-dimensional random 
walk on the DNA when they search for their target sites. Inherently, this mechanism leads to the binding 
of TFs not only to their target sites, but also to other, lower affinity sites on the DNA. In this context, 
it becomes important to understand the relationship between affinity (how strongly a TF binds to a site 
on the DNA) and occupancy (the residence time of a TF on a site). 

Often it is assumed that the relative occupancy of a TF measured experimentally (say, in a ChIP assay) 
is indicative of the relative affinity, and many studies infer a TF's affinity by de novo motif analysis based 
on the most highly occupied sites (those showing the strongest ChIP enrichment). This assumption is 
flawed when there is divergence between occupancy and affinity for these highly occupied sites. Although 
this approximation proved to have good accuracy in the inference of position weight matrices in many 
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cases (jAdrvan et all 120071 e.g.), there are also examples where the method seems to fail (jZeitlinger et al 



20071 e.g.). These cases refer to situations where false positive prediction (sites that have low affinity 



but display high occupancy) or false negative prediction (sites that have high affinity but display low 
occupancy) could have influenced the success of the study. 

Our results indicate that by adding non-cognate TFs, the absolute occupancy of binding sites by 
cognate TF molecules is reduced (see Appendi JiD\ . The reduction in the absolute value of the occupancy 
is a consequence of the competition of TFs for the limited amount of DNA. Wasson and Harteminkl (2009) 
observed the same effect, although they used a different approach (a statistical thermodynamics model) 
to estimate the occupancy. However, in their study, they did not look at the occupancy relative to the 
highest value (the quantitative readout of binding events). 

We found that the abundance of non-cognate TFs has a limited effect on the normalised occupancy 
of low, medium and high affinity sites; sec Figure EK-B) and Figure |3J Nevertheless, there are several 
sites (in the order of tens) , where the addition of non-cognate TFs leads to significant deviations of the 
observed occupancy derived from simulation (SDO) from that derived from affinity (ADO). This result 
is supported by recent experimental evidence, where the authors showed that lac repressor occupancy 
increases at lower sites (far away from the Q \ site), when the crowdin g in the cell increases (and, thus, 
the crowding on the DNA increases as well) ( Kuhlman and Coxl . 12012 ) . 

Bacterial TFs are expressed at low copy numbers (betwe en 10 and 100) ( Wunderlich and Mirnvl.l2009f) 
and t hey have only a few strong sites that are highly specific ( Stormo and Fields! 19981 Wunderlich and Mirny , 
2009). This suggests that, in the case of bacterial gene regulation, affinity controls the relative occupancy 



of the specific sites (acting as a local fine tuning mechanism), while the crowding level on the DNA 
controls the global occupancy of the sites (acting as a global regulator) . 

We also investigated under which conditions the normalised occupancy of the medium and high affinity 
sites is affected. Our results confirmed that for TFs with 10 3 — 10 4 copies per cell and approximately 
4 Mbp of available DNA, the occupancy is higher than that predicted by affinity, irrespective of the 
abundance of n on-cognate TF s. Eukaryotic systems have TFs with high abundance (on average 3 x 10 4 
copies per cell) (jBigginl 1201 if ) and although they have much larger genomes, only a sma ll proportion of 
this i s accessible to TFs (e.g., « 4 Mbp in early developmental stages of D. melanogaster) ([Thomas et al. . 
l201lh . This suggests that the rate of false positive binding events (higher occupancy than predicted by 
affinity) is significant in eukaryotic cells; see Figure [5] Note that our model is applicable only to TFs 
residing in the nucleoplasm and, thus, wh e n we mention TF abundance in eukaryotic systems we refer to 



nuclear abundance of TFs ( Fowlkes et all 12008 ) 



Kaplan et al.l ( 201ll ) investigated the relationship between experimentally measured occupancy (from 



ChlP-seq experiments) and that predicted using a hidden Markov model, and found that the highest 
correlation between the two was on average ~ 0.7. To achieve this correlation they ass umed real TF 
abundances that were previously measured in D. melanogaster nuclei ( Fowlkes et all 2008 ). but they did 
not adapt the abundances of TFs to the size of the analysed DNA segment. In ( ZabetT 2012 ). we showed 
that, when the number of bound TF molecules is not changed in such a subsystem (a simulated entity 
smaller than the genome), the correlation coefficient between the occupancy of the full system and the 
occupancy of the subsystems can be as low as 0.4. This result is also shown in Figures [3] and El which 
confirm that an increase in cognate TF copy number can lead to a reduction in the correlation between 
occupancy and affinity landscape. Thus, one method to increase the correlation between the predicted 
and observed occupancy c onsists of adapting the abundance levels of the TFs with one of the methods 
presented in (jZabetl . Enl . 

In addition, this higher number of highly occupied sites is also influenced by the information content 
of the motif. In Figure [BJ we showed that, by reducing the information content, the number of sites with 
high SDO increases, but also that the effects of the increase in TF abundance on the highly occupied 
sites is more dramatic. In other words, by increasing the abundance of a TF with a PWM with lower 
information content, we observed a larger increase in the number of highly occupied sites compared to 
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the case of a TF with a PWM with higher information content; compare different rows in Figure [SJ This 
suggests that, in the case of eu karyotic systems (which have TFs with lower information content PWMs 
( Wunderlich and Mirny , 2009t) and higher abundances ( Biggin , 201ll) ). the effects of TF abundance on 
the number of 'false positive' sites is more severe than in the case of bacterial cells. 

Our approach to reduce the information content (by removing positions from the end of the lad motif) 
is prone to introduce biases in the results, in particular, at high abundance of the TF and low number 
of highly occupied sites; sec Figure |6|-B). A different approach to reduce the information content could 
be to add non-specific sites uniformly when constructing the PWM, but we anticipate this would lead 
to similar results, namely: in the case of lower information content motifs, a change in the abundance 
of TF has more drastic effects on the number of highly occupied sites, compared to the case of higher 
information content motifs. Nevertheless, the details of this applying a different approach to reduce the 
information content need to be left for further research as it is beyond the scope of this manuscript. 

Finally, we found that the increase in occupancy caused by the addition of cognate molecules can be 
reduced by adding non-cognate molecules. Figure shows that while, in the case of empty DNA, 

most of the sites display an occupancy in the simulations that is higher by at least 100% than that 
predicted from affinity; in the case of high crowding on the DNA, only several hundred sites display such 
a difference between SDO and ADO. However, this difference is still large, in the order of 70%. 
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APPENDIX 



A TF parameters 



The default parameters used here were previously derived in (jZabet and Adrvanl . l2012al ) and (jZabet 
20121 ) and are listed in Tabled 

The PWM of lad was presented in (jZabetl . l2012h and is also listed in Tabled 



B Measuring the occupancy in the simulations 

Figure [JJ plots the distribution of the logarithm of the ratio between the time and the ensemble averages 
for the strongest 577 sites. One can observe that by increasing the simulation time, bot the time average 
and the hybrid average will deviate from the ensemble average. Furthermore, Figure [7J confirms that the 
hybrid average performed using 40 independent replicates, each simulated for 3000 s is a good estimate 
for the ensemble average. 

C System size reduction accuracy 

The association rate model required the determination of the actual time spent on the DNA. The 
proportion of time the lad molecules spend on the DNA varied if the association rate was fixed to 
^faci° c = 2400 s _1 , while the percentage of the covered DNA was raised by increasing both the abundance 
and association rate of non-cognate TFs. The values of the proportion of time the lad molecules spend 
on the DNA are plotted in Figure [5] 
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Effect of simulation time on ergodicity 



A 1 lacl, non-cognate 
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Figure 7. Comparing the time average to the ensemble average for various abundances of cognate and 
non-cognate molecules The system consists of 1 Kbp of DNA which contains the Oi site. There are three 
cases with respect to the amounts of TFs: (i) 1 lacl molecule and non-cognates, (ii) 20 lacl molecules 
and non-cognates and (Hi) 1 lacl molecules and 20 non-cognates. In addition, we considered three values 
for the simulation time when computing the time and hybrid averages: (i) T\ = 100 s, (ii) T; = 3000 s and 
(Hi) Ti = 10000 s. (A), (B) and (C)the boxplots represent the mean of the logarithm of the ratio between 
the time average and the ensemble average over 40 replicates. A value of indicates that the time average 
is equal to the ensemble average. (D), (E) and (F)thc boxplots represent the standard deviation of the 
logarithm of the ratio between the time average and the ensemble average over 40 replicates. The sites 
that have a binding energy lower than 30% of the highest value (423) sites were removed. By increasing 
the simulation time, both the mean and the standard deviation of the logarithm of the ratio between the 
time average and the ensemble average tend to 0, showing that a longer simulation time leads to smaller 
differences between time and ensemble averages. 
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Figure 8. The proportion of time the lacl molecules spend bound to the DNA in the full system, when 
the crowding on the DNA is altered by changing the abundance and association rate of non-cognate TFs. 
We performed a set of 20 simulations of the full system each lasting: (i) 3 s for 1 lacl, (ii) 2 s for 10 
lacl, (Hi) 1 s for 100 lacl and (iv) 1 s for 1000 lacl. The shaded area indicates values that are biological 
plausible. 
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parameter 


lad 


non-cognate 


notation 


copy number 


see main manuscript 


TF X 


motif sequence 


see Table |U 






energetic penalty for mismatch 


1 KbT 


13 K B T 


e* 


nucleotides covered on left 


bp 


23 bp 




nucleotides covered on right 


bp 


23 bp 


TF ri g ht 


association rate to the DNA 


see main manuscript 


h, assoc 
^x 


unbinding probability 


0.001474111 


0.001474111 


punbind 
r X 


probability to slide left 


0.4992629 


0.4992629 


plctt 
r X 


probability to slide right 


0.4992629 


0.4992629 


plight 

± X 


probability to dissociate com- 


0.1675 


0.1675 


pjump 

X 


pletely when unbinding 








time bound at the target site 


1.18E -6s 


0.3314193 s 


T U 

X 


the size of a step to left 


1 bp 


1 bp 




the size of a step to right 


1 bp 


1 bp 




variance of repositioning dis- 


1 bp 


1 bp 


^hop 


tance after a hop 






the distance over which a hop be- 


100 bp 


100 bp 


^jump 


comes a jump 








the proportion of prebound 


0.0 


0.9 




molecules 








affinity landscape roughness 




1.0 K B T 





Table 5. TF species default parameters 



Furthermore, it is important to test whether the one dimensional statistics (sliding length and res- 
idence time) are affected by increasing the number of non-cognate TFs. Figure [UJ shows that, for bi- 
ologically plausible values, for the proportion of covered DNA (between 10% and 50%), the sliding 
l ength and the residence t ime deviate only negligibly from the values that were estimated previously 



(jZabet and Adrvanl . l2012al) . 



D The dependence of absolute occupancy on TF competition 

Figure [TU] shows that the absolute SDO (not normalised to the maximum value) is not significantly 
affected by crowding on the DNA, but strongly depends on the abundance of lad molecules. 

E The average number of bound lad molecules 

Figure [11] confirms that there is a reduction in the number of bound lad molecules when the crowding 
on the DNA is increased by adding more non-cognate molecules. This is valid for all lad abundances. 

F Significant difference between SDO and ADO 

Figure rTJ] shows that the sites where SDO differs significantly from ADO are medium and low affinity 
sites. 
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PWM 


Position 


A 


C 


G 


T 


1 


0.6200 


-0.6900 


0.1400 


-0.6900 


2 


0.6200 


-0.6900 


0.1400 


-0.6900 


3 


0.1600 


0.1400 


-0.6900 


0.1800 


4 


0.1600 


-0.6900 


-0.6900 


0.6200 


5 


-0.7000 


-0.7000 


0.9000 


-0.7000 


6 


-0.6900 


-0.6900 


-0.6900 


0.9300 


7 


0.0077 


-0.0084 


-0.0073 


0.0083 


8 


0.0077 


-0.0084 


-0.0073 


0.0083 


9 


0.0077 


-0.0084 


-0.0073 


0.0083 


10 


0.0077 


-0.0084 


-0.0073 


0.0083 


11 


0.0077 


-0.0084 


-0.0073 


0.0083 


12 


0.0077 


-0.0084 


-0.0073 


0.0083 


13 


0.0077 


-0.0084 


-0.0073 


0.0083 


14 


0.0077 


-0.0084 


-0.0073 


0.0083 


15 


0.0077 


-0.0084 


-0.0073 


0.0083 


16 


0.6200 


-0.6900 


0.1400 


-0.6900 


17 


-0.7000 


0.9000 


-0.7000 


-0.7000 


18 


0.9300 


-0.6900 


-0.6900 


-0.6900 


19 


0.9300 


-0.6900 


-0.6900 


-0.6900 


20 


-0.6900 


0.1400 


-0.6900 


0.6200 


21 


-0.6900 


0.1400 


-0.6900 


0.6200 



Table 6. lad PWM 
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sliding length (bp) 



observable sliding length (bp) 



residence time (ms) 




Figure 9. One dimensional statistics for various levels of non- cognate TFs. We performed a set of 
X = 20 simulations of the 100 Kbp subsystem each lasting 7} = 3000 s, using the parameters presented 
in the main manuscript and the parameters from Table [5j 
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lacl copies 
1 

10 
+ 100 
x 1000 
O 10000 



-4 -3 -2 -1 

log(normalised ADO) 



B 



10 lacl molecules 




-3 -2 
log(normalised ADO) 



crowding 

% 
9 % 
26 % 
x 42 % 
O 55 % 



Figure 10. ADO and SDO for various abundances of lacl and crowding on the DNA. This 
is the same as Figure 3 in the main manuscript, except that the SDO was not normalised. 



G Generating the in silico ChIP profile 

The R code that generates the i n silico ChIP profile (see below) is an implementation of the method 
described in ([Kaplan et all 120111 ). 



generateChlPProf ile <- f unction(input . vec , mean, sd, smooth = NULL) { 
var = sd~2 
shp = mean~2/var 
scl = var/mean 
1 = length (input . vec) 

f = dgamma(0 : length(input . vec) , shape = shp, scale = scl) 
F = rev(cumsum(rev(f ) ) ) 



peak. centres = whi ch( input . vec > mean(input . vec) ) 
peaks = vectorC'numeric" , 1) 

for(pc in peak. centres) { 

this. peak = vectorC'numeric", 1) 

this. peak [pc:l] = F [1 : (1-pc+l)] 

this.peak[l: (pc-1)] = F[pc:2] 

peaks = peaks + this. peak * input .vec [pc] 

} 



if ( ! is . null (smooth) ) { 

if ((smooth °/°L 2) == 0) {smooth = smooth - 1} 
mid = round(smooth/2 , 0) + 1 
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1 lacl molecule 



10 lacl molecules 




100 lacl molecules 



1000 lacl molecules 




% of covered DNA 



% of covered DNA 



Figure 11. The average number of bound molecules for various crowding levels and various lacl abun- 
dances. We performed a set of X = 40 simulations of the 100 Kbp subsystem each lasting Tj = 3000 s, 
using the parameters presented in the main manuscript and the parameters from Table OH 
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Figure 12. Significant deviations between ADO and SDO. This is a the same as Figure 3 in the 
main manuscript, except that in this Figure we did not consider any affinity cut-off and plotted only sites 
where the occupancy in the simulations is at least 2.1 times higher than that predicted by the affinity. 
The number in the parentheses in the legend represents the total number of sites that display an SDO 
at least 2.1 times higher than the ADO for each particular case. In each panel, the abundance of lad 
is kept constant and the crowding on the DNA is increased from 0% to 55%. The level of crowding on 
the DNA (implemented through the abundance of non-cognate TF) influences the number of sites that 
display significant differences between occupancy and affinity. We considered four cases with respect to 
the number of lad molecules: (A) 1, (B) 10, (C) 100 and (D) 1000. 
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d = smooth - mid 

ford in mid: (length (peaks) - d) ) { 

peaks [i] = mean (peaks [max(0 , (i-d) ) :min( length (input . vec) , (i+d) )] ) 

> 



return (peaks) 

} 



H Lower information content motifs 

Our lad motif has an information content of 16.9 bits. Hence, in order to test what is the switching limit 
we removed on base pair from the lad motif and produced six new lower information content motifs; see 
Figure Q2] 
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1 2 3 4 5 6 7 8 9 1011 12131415161718192021 
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Figure 13. Lower information content lad motifs. The information content of the reduced motifs is: 
(i) hach = 15.8 bits, (ii) Iiaci 2 = 14.7 bits, (Hi) 7; ac / 3 12.7 bits, (iv) Ii ac i 4 = 10.7 bits, (v) Ii ac i 5 = 8.7 bits 
and (vi) Ii ac i e = 7.7 bits; see Figure HU 
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i 1 1 1 1 1 r 

1 2 3 4 5 6 



base pairs removed from motif 

Figure 14. Information content of the reduced lad motifs. 
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